Constructing a Parsed Corpus with a Large Lfg Grammar

نویسندگان

Victoria Rosén

Paul Meurer

Koenraad de Smedt

Miriam Butt

چکیده

The TREPIL project (Norwegian treebank pilot project 2004-2008) is aimed at developing and testing methods for the construction of a Norwegian parsed corpus. Annotation of c-structures, f-structures and mrs-structures is based on automatic parsing with human validation and disambiguation. Parsing is done with a large LFG grammar and the XLE parser. We propose a method for efficient disambiguation based on discriminants and we have implemented a set of computational tools for this purpose. 1 Treebanks and parsed corpora We use the term treebank to mean a corpus annotated with sentence structures beyond the part of speech level. Even though the term refers to syntactic tree structures, it is in current usage extended to corpora with all kinds of structural annotation at syntactic and even semantic levels, such as constituent structures, grammatical functions, or predicate-argument relations (Nivre, De Smedt, and Volk, 2005). Our work in the context of the TREPIL project (Norwegian treebank pilot project 2004-2008) is aimed at developing and testing methods for the construction of a Norwegian treebank based on deep parsing. Before going into details about this project and its results so far, we first provide some background on previous related work. Currently, linguists and language engineers have easy access to large text and speech corpora, many of which are annotated at the word level, mostly by parts of speech. Although searching in large corpora for certain words and sequences of words with given categories may yield valuable information, two problems can be discerned (Abeillé, 2003). Firstly, part of speech tagging is of limited use to syntacticians, as it fails to distinguish boundaries of clauses, of phrases and even of compound words that are written separately. Secondly, as automatic part of speech tagging is normally based on a shallow analysis or statistical processor, the quality of the annotated corpus is likely to be unsatisfactory. To overcome the limitations of corpora with word-level annotation only, efforts have been made towards more sophisticated linguistic annotation of corpora. Whereas the first syntactically annotated corpora were developed mostly with manual methods, the development of more sophisticated linguistic models prompted the application of such models to treebank construction (Abeillé, 2003). This has led to the term parsed corpus which is usually reserved for a treebank that is grounded in a computational grammar model. Treebank construction on the basis of automatic parsing with a computational grammar is desirable for both practical and theoretical reasons. Indeed, manual annotation has the disadvantage of being costly and prone to human error, and it is difficult to achieve satisfactory consistency both within and between human annotators (van der Beek et al., 2002a). Moreover, an annotation scheme which is only verbally defined and is not grounded in a computational grammar model risks isolating the corpus from the very applications for which it could be useful. Fully automatic annotation, on the other hand, only works to the extent that the analyses chosen by the parser are correct. Since perfect coverage is not attainable in practice, many current approaches to treebank construction are semi-automatic, in the sense that parser output is validated by a human annotator. Furthermore, automatic parsing usually produces more than one possible analysis, since many sentences can be analyzed in a variety of ways that may be infelicitous and can neither be excluded on purely syntactic grounds nor completely avoided by statistical learning techniques. Therefore, manual disambiguation is a necessity. In the Alpino treebank, for instance, the corpus is automatically parsed with a dependency grammar, assisted by interactive tools for manual checking, including disambiguation and extension of the lexicon (van der Beek et al., 2002b). The TREPIL project has devoted significant efforts to disambiguation methods, as discussed below. Several treebank projects have prominently used the LFG and HPSG formalisms. The PARC 700 Dependency Bank (King et al., 2003) was constructed by a two-step approach. First, a corpus was parsed with an LFG grammar and the best parse for each sentence was chosen manually and stored. Then, from each stored functional structure, a corresponding dependency structure was automatically derived, modified as needed, and validated by a human annotator. The PARC 700 has subsequently been used as an external standard in the evaluation of other f-structure annotations (Burke et al., 2004b). One of the points illustrated by the PARC 700 is that a treebank constructed by parsing with a certain language model (in this case, based on the LFG formalism) nevertheless can be convertible into different linguistic models. A different example of treebanking by conversion is the derivation of Estonian phrase structures on the basis of Constraint Grammar function tags (Bick, Uibo, and Müürisep, 2004). It may perhaps be concluded that aspirations for neutrality with respect to grammatical theory are as unneccessary as they are illusory. Treebanks are currently receiving a lot of attention because they provide highly valuable empirical data for many research questions in linguistics and language technology (Nivre, De Smedt, and Volk, 2005). Provided they are composed and annotated as reference corpora rather than special purpose collections, treebanks allow for multiple uses in the various sciences of language as well as in language technology. Linguists may want to search for examples or counterexamples of syntactic constructions under investigation, whereas psycholinguists may be interested in relative frequencies of various possible attachments of prepositional phrases or relative clauses (Abeillé, 2003). Formal and computational linguists can evaluate the correctness and coverage of grammars and lexicons against the analyses stored in a treebank, and at a more general level, the adequacy of linguistic theories and formalisms can be assessed (Bouma, 2004). From the grammatical information stored in treebanks, other resources such as grammars and lexicons can be induced. Stochastic grammars can be trained using frequency information about the parse choices. Other research has focused on the induction of LFG grammars from existing manually annotated treebanks, with the goal of deriving robust, wide-coverage grammars from treebanks rather than having to hand code them (Burke et al., 2004a). Claims that the performance of automatically induced LFG grammars may surpass that of hand-coded LFG grammars (Cahill, 2004) have to be weighed against questions concerning the generality and explanatory power of the linguistic theories embodied by the induced grammars. More important than the particular formalism used is the level of detail of the analyses in treebanks. While some of the earliest syntactically annotated corpora contain syntactic boundaries, others contain for instance constituent structures (Abeillé, Clément, and Toussenel, 2003), functional dependency structures (Hajič, 1998) or, in addition to syntactic structures, also predicate-argument structures (Marcus et al., 1994). TREPIL is cooperating with the LOGON project on Norwegian-English machine translation (Oepen et al., 2004), which has produced a small treebank containing semantic structures. In the context of translation and contrastive linguistics, we also want to mention the potential of parallel treebanks of translated texts, where detailed and deep analyses offer an interesting domain of study. 2 Treebanking goals in the TREPIL project The TREPIL project is a research project on treebanking methods, aimed at building a Norwegian parsed corpus. The current project is a preparatory project; it will not produce a full-scale treebank, but a methodology, a set of computational tools, and a demonstration corpus. Our hope is that the resulting methods and tools will be put to use in a subsequent project for building a large scale Norwegian treebank which will form part of a future Norsk Språkbank (Norwegian Language Bank). The TREPIL project uses the LFG formalism and explores a tight relation between a grammar and a corpus, but our focus is different from earlier LFG-banking projects. Our method for treebank construction is based on the testing and further improvement of an existing hand-coded grammar and parser, and its extension with additional treebanking tools, primarily for disambiguation. For this purpose, we use the NorGram LFG grammar for Norwegian, together with the Xerox Linguistic Environment (XLE). Our motivation for using NorGram as a starting point is twofold. Firstly, NorGram is currently the only deep grammar for Norwegian with large coverage. Secondly, the grammar is developed in the international ParGram project (Butt et al., 2002) which attains a certain level of generality across languages through agreements on similar feature structures and the existence of a transfer formalism for f-structure based translation. However, we do not want to overemphasize the choice of formalism for reasons outlined above. An innovative characteristic of TREPIL is that, in contrast to the single stratum approaches of most other treebanks, the Norwegian grammar generates three separate but interrelated structures for each sentence: a constituent structure, a functional structure, and a semantic structure. The semantic projection is based on Minimal Recursion Semantics (MRS) (Copestake et al., in preparation), which allows a deeper level of semantic description than the predicate-argument coding in the Penn treebank (Marcus et al., 1994). MRS represents the semantics of a sentence as a bag of elementary predications, underspecified for scope. The mrs-structures are derived by co-description and may contain information that cannot be derived from the cor f-structures, such that the mrs-projection represents an autonomous level of structure. The triple stratum annotation generated by our grammar represents a rich, layered description of the syntax and semantics of each sentence, which allows for multiple uses. However, this sophistication comes at a price, because disambiguation and validation are nontrivial and manual annotation would be quite difficult and inefficient. Thus, our treebanking method is strongly dependent on computational systems, including an efficient parser and a grammar of which the coverage and precision is being continually improved. One sometimes comes across scepticism with respect to the possibility of deep (full) parsing, often by adherents of shallow parsing as an approximation. However, it has also been pointed out that much of the scepticism is unwarranted for the XLE parser (Zaenen, 2004). Although full parsers may be slower, the XLE parser is still fast enough for off-line parsing of a corpus. Also, it has been pointed out that some full parsers are too brittle to deal with anomalous input, but the XLE parser allows fragment parses, so any input can receive some form of analysis. It has also been claimed that full parsers may yield so many parses that applications have difficulty coping with them. While this problem is being addressed by the inclusion of probability based disambiguation components, such automatic disambiguation is not feasible in our situation, since it would need to be bootstrapped from a treebank which is not yet built. Instead, we are focusing our attention on an altogether different method aimed at highly efficient manual disambiguation. This method, which will be the primary focus of the next section of this paper, is supported by a computational tool which we have implemented in a working first version. Also, we are working towards a system which allows the automatic reanalysis of the corpus as the grammar develops. In as far as TREPIL involves the synchronous evolution of a treebank and a grammar, our approach is similar to that of LinGO Redwoods (Oepen et al., 2003), which is based on HPSG and the LKB parser environment. LinGO has developed a set of advanced tools that allow the automated update of the treebank after reparsing with a new version of the grammar, but without having to fully disambiguate the corpus over again. This is achieved by reapplying earlier recorded choices by the annotator in the selection of the preferred parse, based on techniques proposed by Carter (1997). A crucial point is that not only the preferred analysis of the sentence is recorded, but all decisions made as part of the annotation in the database. We believe there are important methodological advantages to our approach. Instead of building a treebank incrementally and improving the grammar independently, we develop an efficient way to successively reannotate the corpus with each version of the grammar, thus obtaining a parsed corpus that is fully consistent with the grammar. The end result is therefore not only a treebank, but also a grammar that can be deployed in other applications, for example machine translation, especially since it produces semantic analyses. In view of such applications, we believe it is advantageous to retain manual control over the grammar in order to obtain the kind of abstraction and readability required by a linguist, rather than inducing an entirely new grammar from the treebank. 3 Disambiguation with XLE One of the main challenges in using parser output for treebanking is selecting the desired parse among a potentially large number of parses. It is worth remembering that the number of parses is exponential to the number of ambiguities, such that up to 2n analyses may be produced for n binary choices. Six unresolved, independent parsing choices can for instance give rise to 64 analyses. Consequently, a disambiguation strategy that concentrates on local ambiguities might be more efficient than one that only looks at the whole set of resulting analyses. XLE has a built-in facility for disambiguation in the form of packed representations. When a sentence is parsed, XLE displays the analyses one at a time in the c-structure and f-structure windows. This allows the user to browse through all the analyses and inspect each c-structure and its corresponding f-structures in turn. In addition, f-structure chart windows show packed representations of all analyses. There are two different formats in which this compact information is shown, but we will concentrate on the f-structure chart window that indexes the analyses by constraints, providing a view of choices listed as alternatives. When a sentence contains a single ambiguity, this type of representation makes it easy to spot the source of the ambiguity, as shown in figure 1 for example 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards data-intensive testing of a broad-coverage LFG grammar

This paper addresses the problem that manual checking of output representations becomes impracticable in extensive tests during grammar development or in data-intensive applications of the grammar, like grammar-based lexicon acquisition from corpora. A method of annotating the sentences to be parsed with target expressions is proposed, using the LFG formalism itself to specify the expressions, ...

متن کامل

Dependency-Based Sentence Simplification for Large-Scale LFG Parsing: Selecting Simplified Candidates for Efficiency and Coverage

Large scale LFG grammars achieve high coverages on corpus data, yet can fail to give a full analysis for each sentence. One approach proposed to gain at least the argument structure of those failed sentences is to simplify them by deleting subtrees from their dependency structure (provided by a more robust statistical dependency parser). The simplified versions are then re-parsed to receive a f...

متن کامل

The Creation of a Large-Scale LFG-Based Gold Parsebank

Systems for syntactically parsing sentences have long been recognized as a priority in Natural Language Processing. Statistics-based systems require large amounts of high quality syntactically parsed data. Using the XLE toolkit developed at PARC and the LFG Parsebanker interface developed at Bergen, the Parsebank Project at Powerset has generated a rapidly increasing volume of syntactically par...

متن کامل

Towards data - intensive testing of abroad - coverage LFG grammar Jonas

متن کامل

C-structures and F-structures for the British National Corpus

We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Constructing a Parsed Corpus with a Large Lfg Grammar

نویسندگان

چکیده

منابع مشابه

Towards data-intensive testing of a broad-coverage LFG grammar

Dependency-Based Sentence Simplification for Large-Scale LFG Parsing: Selecting Simplified Candidates for Efficiency and Coverage

The Creation of a Large-Scale LFG-Based Gold Parsebank

Towards data - intensive testing of abroad - coverage LFG grammar Jonas

C-structures and F-structures for the British National Corpus

عنوان ژورنال:

اشتراک گذاری